feat: improve french text normalization with number conversion and contraction expansion#13
feat: improve french text normalization with number conversion and contraction expansion#13Karamouche merged 10 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds French number normalization using text2num.alpha2digit with a pre-pass for single-digit + large-unit patterns, expands French replacement maps and operator methods (contractions, written-number expansion), adds sentence-level replacements, tests, a pyproject dependency, and package-data entry. Changes
Sequence Diagram(s)sequenceDiagram
actor Caller
rect rgba(100,150,240,0.5)
participant FrenchOperators
end
rect rgba(100,200,120,0.5)
participant FrenchNumberNormalizer
end
rect rgba(220,120,120,0.5)
participant text2num
end
Caller->>FrenchOperators: process text
FrenchOperators->>FrenchNumberNormalizer: expand_written_numbers(text)
FrenchNumberNormalizer->>FrenchNumberNormalizer: pre-pass rewrite (e.g., "3 milliards" → "trois milliards")
FrenchNumberNormalizer->>text2num: alpha2digit(text, "fr")
text2num-->>FrenchNumberNormalizer: normalized text
FrenchNumberNormalizer-->>FrenchOperators: normalized text
FrenchOperators-->>Caller: final normalized text
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
normalization/languages/french/replacements.py (1)
1-28: Add type annotation for consistency.The
FRENCH_REPLACEMENTSdictionary is missing an explicit type annotation, unlikeFRENCH_SENTENCE_REPLACEMENTSwhich has: dict[str, str]. Adding this improves consistency and helps static type checkers.♻️ Suggested fix
-FRENCH_REPLACEMENTS = { +FRENCH_REPLACEMENTS: dict[str, str] = { # contractions in titles/prefixes "mme": "madame",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/replacements.py` around lines 1 - 28, The FRENCH_REPLACEMENTS dict lacks an explicit type annotation; add the same annotation used for FRENCH_SENTENCE_REPLACEMENTS (dict[str, str]) to FRENCH_REPLACEMENTS to make types consistent and satisfy static type checkers, updating the declaration of FRENCH_REPLACEMENTS accordingly.normalization/languages/french/number_normalizer.py (1)
30-52: Consider adding "mille" (thousand) and "cent" (hundred) to the pattern.Currently, the pattern only handles
millions?|milliards?|billions?|trillions?, but French also usesmilleandcentas scale words. The same concatenation issue that affects "3 milliards" → "31e9" could theoretically apply to "3 mille" and "3 cent". Since both are already inFRENCH_CONFIG.number_wordsand the fix (single-digit-to-word conversion) is identical to the existing logic, extending the pattern would be consistent:r"\b(\d+)\s+(millions?|milliards?|billions?|trillions?|mille|cent)\b"If alpha2digit reliably handles these cases without pre-normalization, the current implementation is fine; otherwise, this extension ensures uniform handling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/number_normalizer.py` around lines 30 - 52, The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers function need to cover French "mille" and "cent" scales too; update the pattern defined in _RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural form for "cent" if desired) so single-digit numbers like "3 mille" or "3 cent" are converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving the existing replace logic in _normalize_mixed_numbers unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The negative lookahead used to prevent expansions before
vowels+`h` is wrong because the `vowels` string (variable name `vowels`) omits
`'h'`, so the regex `_V = rf"(?![{vowels}{vowels.upper()}])"` will not block
elisions like "j'homme" or "l'heure"; fix by adding `'h'` (and optionally `'H'`
via the existing uppercasing) to the `vowels` string so that `_V` correctly
includes `h` in its character class, leaving the comment "Vowels + h" consistent
with the code.
---
Nitpick comments:
In `@normalization/languages/french/number_normalizer.py`:
- Around line 30-52: The _RE_MIXED_NUMBER regex and _normalize_mixed_numbers
function need to cover French "mille" and "cent" scales too; update the pattern
defined in _RE_MIXED_NUMBER to include "mille" and "cent" (and optional plural
form for "cent" if desired) so single-digit numbers like "3 mille" or "3 cent"
are converted to words via _DIGIT_TO_FRENCH before alpha2digit runs, preserving
the existing replace logic in _normalize_mixed_numbers unchanged.
In `@normalization/languages/french/replacements.py`:
- Around line 1-28: The FRENCH_REPLACEMENTS dict lacks an explicit type
annotation; add the same annotation used for FRENCH_SENTENCE_REPLACEMENTS
(dict[str, str]) to FRENCH_REPLACEMENTS to make types consistent and satisfy
static type checkers, updating the declaration of FRENCH_REPLACEMENTS
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 351ddfe9-43d4-424e-b843-18029e918670
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (8)
normalization/languages/french/number_normalizer.pynormalization/languages/french/operators.pynormalization/languages/french/replacements.pynormalization/languages/french/sentence_replacements.pypyproject.tomltests/unit/steps/text/apply_sentence_level_replacements_test.pytests/unit/steps/text/conftest.pytests/unit/steps/text/expand_contractions_test.py
0f9970b to
6ae7563
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
normalization/languages/french/operators.py (1)
115-117:⚠️ Potential issue | 🟠 MajorThe apostrophe guard still expands before
h.The comment says “vowels + h”, but
his still missing fromvowels, so inputs likej'habiteandl'heurecan be rewritten toje habiteandle heure. This is the same bug that was raised on the previous revision and it is still present.🐛 Minimal fix
- vowels = "aàâeéèêiîïoôuùûy" + vowels = "aàâeéèêiîïoôuùûyh"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 115 - 117, The guard for elision is missing 'h' in the vowels set: update the vowels variable used to build _V in normalization/languages/french/operators.py so it includes 'h' (and 'H' via the existing upper() usage) to match the comment "Vowels + h"; ensure the _V regex construction (the variable _V) continues to use that updated vowels string so elision before h (e.g., "j'habite", "l'heure") will not be expanded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing common hyphenated French
numerals; update the number_words variable to include standard hyphenated forms
such as "dix-sept", "dix-huit", "dix-neuf", "soixante-dix", "quatre-vingt",
"quatre-vingts", and "quatre-vingt-dix" so the detection covers ordinary French
spellings (modify the number_words array in the French operators module).
- Around line 119-126: The generic expansions for c' and l' (the two re.sub
calls matching rf"\bc'{_V}" and rf"\bl'{_V}") are unsafe and produce
ungrammatical output (e.g., "c'pas" -> "ce pas", "l'voiture" -> "le voiture");
remove or restrict these two rules: either delete/comment out the re.sub lines
for c' and l' in operators.py, or replace them with narrowly scoped expansions
that only match known safe contractions (for example only expand "c'est"/"c'et"
to "ce " and avoid expanding arbitrary c' + _V, and do not expand l' at all).
Keep the other contractions (d', qu', n', s', m', t') unchanged and continue
using the same text variable and _V pattern.
---
Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 115-117: The guard for elision is missing 'h' in the vowels set:
update the vowels variable used to build _V in
normalization/languages/french/operators.py so it includes 'h' (and 'H' via the
existing upper() usage) to match the comment "Vowels + h"; ensure the _V regex
construction (the variable _V) continues to use that updated vowels string so
elision before h (e.g., "j'habite", "l'heure") will not be expanded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7fa84fd4-88f8-4b61-a368-d649ce4796c0
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (8)
normalization/languages/french/number_normalizer.pynormalization/languages/french/operators.pynormalization/languages/french/replacements.pynormalization/languages/french/sentence_replacements.pypyproject.tomltests/unit/steps/text/apply_sentence_level_replacements_test.pytests/unit/steps/text/conftest.pytests/unit/steps/text/expand_contractions_test.py
✅ Files skipped from review due to trivial changes (2)
- normalization/languages/french/sentence_replacements.py
- normalization/languages/french/replacements.py
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/unit/steps/text/conftest.py
- pyproject.toml
- tests/unit/steps/text/expand_contractions_test.py
- normalization/languages/french/number_normalizer.py
- tests/unit/steps/text/apply_sentence_level_replacements_test.py
There was a problem hiding this comment.
♻️ Duplicate comments (2)
normalization/languages/french/operators.py (2)
55-92:⚠️ Potential issue | 🟠 MajorAdd standard hyphenated French numerals to
number_words.Line 55–92 still misses common standard forms (
dix-sept,dix-huit,dix-neuf,soixante-dix,quatre-vingt,quatre-vingts,quatre-vingt-dix), so config-driven number-word detection can skip ordinary spellings.Suggested patch
number_words=[ "zéro", "un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix", "onze", "douze", "treize", "quatorze", "quinze", "seize", + "dix-sept", + "dix-huit", + "dix-neuf", "vingt", "trente", "quarante", "cinquante", "soixante", + "soixante-dix", "septante", "octante", "huitante", "nonante", + "quatre-vingt", + "quatre-vingts", + "quatre-vingt-dix", "cent",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 55 - 92, The number_words list is missing standard hyphenated French numerals which will cause ordinary spellings to be missed; update the number_words list in normalization/languages/french/operators.py (the number_words variable) to include the common hyphenated forms: dix-sept, dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and quatre-vingt-dix (and any plural or variant forms you need) so the parser recognizes these standard spellings; keep entries as lowercase strings consistent with the existing list.
119-126:⚠️ Potential issue | 🟠 MajorAvoid generic expansion for
c'andl'.Line 119 and Line 126 can produce ungrammatical output (
c'pas→ce pas,l'voiture→le voiture). These two rules are too ambiguous for unconditional expansion.Safer minimal patch
text = re.sub(rf"\bj'{_V}", "je ", text, flags=re.IGNORECASE) - text = re.sub(rf"\bc'{_V}", "ce ", text, flags=re.IGNORECASE) text = re.sub(rf"\bd'{_V}", "de ", text, flags=re.IGNORECASE) text = re.sub(rf"\bqu'{_V}", "que ", text, flags=re.IGNORECASE) text = re.sub(rf"\bn'{_V}", "ne ", text, flags=re.IGNORECASE) text = re.sub(rf"\bs'{_V}", "se ", text, flags=re.IGNORECASE) text = re.sub(rf"\bm'{_V}", "me ", text, flags=re.IGNORECASE) text = re.sub(rf"\bt'{_V}", "te ", text, flags=re.IGNORECASE) - text = re.sub(rf"\bl'{_V}", "le ", text, flags=re.IGNORECASE) return text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/french/operators.py` around lines 119 - 126, The unconditional replacements for rf"\bc'{_V}" and rf"\bl'{_V}" produce ungrammatical expansions (e.g., "c'pas" → "ce pas", "l'voiture" → "le voiture"); remove these two generic lines or replace them with targeted rules that only expand well-known contractions (e.g., match "c'est", "c'était", "c'll?" or a small whitelist) rather than any c' or l' followed by a vowel. Locate the two regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in the normalization/languages/french/operators.py code and either delete them or change them to explicit, whitelist-based patterns to avoid ambiguous expansions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@normalization/languages/french/operators.py`:
- Around line 55-92: The number_words list is missing standard hyphenated French
numerals which will cause ordinary spellings to be missed; update the
number_words list in normalization/languages/french/operators.py (the
number_words variable) to include the common hyphenated forms: dix-sept,
dix-huit, dix-neuf, soixante-dix, quatre-vingt, quatre-vingts, and
quatre-vingt-dix (and any plural or variant forms you need) so the parser
recognizes these standard spellings; keep entries as lowercase strings
consistent with the existing list.
- Around line 119-126: The unconditional replacements for rf"\bc'{_V}" and
rf"\bl'{_V}" produce ungrammatical expansions (e.g., "c'pas" → "ce pas",
"l'voiture" → "le voiture"); remove these two generic lines or replace them with
targeted rules that only expand well-known contractions (e.g., match "c'est",
"c'était", "c'll?" or a small whitelist) rather than any c' or l' followed by a
vowel. Locate the two regex substitutions using rf"\bc'{_V}" and rf"\bl'{_V}" in
the normalization/languages/french/operators.py code and either delete them or
change them to explicit, whitelist-based patterns to avoid ambiguous expansions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4c70e403-8181-4220-bb97-0a1cd681510b
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (1)
normalization/languages/french/operators.py
What does this PR do?
Enhances the French text normalization pipeline with richer, more consistent output:
Converts written numbers to digits
Expands apostrophe contractions
Adds word/phrase replacements for cleaner normalization
Also adds unit tests with a French operators fixture covering sentence-level replacements and contraction behavior, plus a runtime dependency for number-to-digit conversion.
Type of change
languages/{lang}/)steps/text/orsteps/word/)presets/)Checklist
New language
languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py, not inline inoperators.py@register_languagelanguages/__init__.pytests/unit/languages/tests/e2e/files/New step
nameclass attribute is unique and matches the YAML key@register_stepsteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*, no hardcoded language-specific valuesif operators.config.field is None: return textsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyto regeneratedocs/steps.mdPreset change
Tests
Summary by CodeRabbit